PCA and t-SNE Project: Auto MPG

Marks: 30

Welcome to the project on PCA and t-SNE. In this project, we will be using the auto-mpg dataset.


Context


The shifting market conditions, globalization, cost pressure, and volatility are leading to a change in the automobile market landscape. The emergence of data, in conjunction with machine learning in automobile companies, has paved a way that is helping bring operational and business transformations.

The automobile market is vast and diverse, with numerous vehicle categories being manufactured and sold with varying configurations of attributes such as displacement, horsepower, and acceleration. We aim to find combinations of these features that can clearly distinguish certain groups of automobiles from others through this analysis, as this will inform other downstream processes for any organization aiming to sell each group of vehicles to a slightly different target audience.

You are a Data Scientist at SecondLife which is a leading used car dealership with numerous outlets across the US. Recently, they have started shifting their focus to vintage cars and have been diligently collecting data about all the vintage cars they have sold over the years. The Director of Operations at SecondLife wants to leverage the data to extract insights about the cars and find different groups of vintage cars to target the audience more efficiently.


Objective


The objective of this problem is to explore the data, reduce the number of features by using dimensionality reduction techniques like PCA and t-SNE, and extract meaningful insights.


Dataset


There are 8 variables in the data:

Brief Insight about the Dataset:

The dataset contains information about various vintage cars, with each row representing a car and each column capturing specific characteristics of the cars. These features include factors like miles per gallon (mpg), number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and car name. Since this is an unsupervised learning task, there is no target variable, and all the features are considered independent. The goal is to explore and group the cars based on these characteristics to uncover meaningful patterns and insights.

Importing the necessary libraries

Loading the data

Data Overview

Purpose: This is the process of exploring and getting familiar with the dataset.

Observation :

Insights

Sanity Check :

Purpose: This is a process of validating the dataset to ensure it is logical,consistent, and usable.

Actions:

1) There are no missing values in any of the columns in the dataset.
2) According to the info() function, we notice float64(3), int64(3), object(2) columns. Note: Although the horsepower column contains integer values, the info() function indicates it is stored as an object data type, which needs to be corrected for proper numerical analysis.

Observation:

The output shown doesn't have any obvious non-numeric values. All the entries are numbers represented as strings. This can happen if the column has been incorrectly read as a string (object) type rather than as numbers.

Observation:

The output shows 'float64' as a datatype for column horsepower.

Observation

Observation

This information can help in segmenting these cars into different audience groups based on what potential buyers value (e.g., fuel efficiency vs. performance).

Data Preprocessing and Exploratory Data Analysis

Checking Correlation among the variables

Observations:

Checking Outliers

Observations on Outliers in the Numerical Features:

Missing value treatment

Observations on boxplot diaplay of the horsepower feature :

Summary Statistics

Observations:

observation: There is no missing data.

Observation:

  1. Frequency Distribution: The count plot reveals the distribution of different car names in the dataset, allowing us to identify which models are most and least prevalent.

  2. Popular Car Models: Certain car names are significantly more common than others, indicating that these models may have higher sales or popularity during the time frame of the dataset. Ford Pinto has the maximum count.

  3. Insights on Trends: This analysis could provide insights into market trends, brand popularity, and consumer preferences for specific car models within the dataset.

Insights:

Trends Over Time: The scatter plot visualizes how different car models relate to their manufacturing years, indicating trends in the automotive industry over time. This can help identify which models were popular in specific decades.

Concentration of Models: There may be clusters of car names corresponding to certain model years, suggesting that specific models were produced during particular time periods. This can indicate brand strategies or market demands.

Variation in Production Years: Some car names may show a broader range of production years, while others may have a more concentrated production span, hinting at model longevity or discontinuation.

Discontinuation of Models: A noticeable gap in the scatter plot for certain car names could indicate models that were discontinued after a certain year, pointing to changing consumer preferences or manufacturer strategies.

Market Evolution: The plot can help illustrate the evolution of car design and technology over the years, with certain car names reflecting historical or stylistic changes in the industry.

Pairwise Relationships:

The pair plot provides a visual representation of the relationships between each pair of features in the dataset. By examining these scatter plots, we can identify potential correlations or patterns between variables.

Scaling the data

Why we need to Scale the data ?

Principal Component Analysis

PCA (Principal Component Analysis) helps when some of the features in the data are too similar, which can cause problems. These similar features can confuse algorithms, like clustering, because they provide duplicate information.

PCA solves this by finding new "important" features that represent the most variation (or differences) in the data.The new features created by PCA, called principal components, are not related to each other, which removes the overlap of information.

This makes the clustering more accurate and easier because it focuses on the main patterns in the data without any confusion from similar features.

Observations: We can see that out of the original 8 features, we have reduced the number of features through PCA to 4 principal components. The first four principal components explain approximately 94% of the original variance. So that is about a 50% reduction in the dimensionality of the dataset with only a loss of less than 10% in variance.

Observations:

Interpret the coefficients of the first three principal components from the below DataFrame

Observations:

1) Principal Component 1 (PC1): High Positive Loadings: Cylinders (0.932776), Displacement (0.962400), Horsepower (0.945823), and Weight (0.927712) all have strong positive coefficients. This suggests that PC1 is primarily influenced by these features. High Negative Loading: Miles per Gallon (mpg, -0.890788) and Acceleration (-0.637912) have negative coefficients. This indicates that as the values of these features increase, the value of PC1 decreases.

Insight: PC1 seems to represent a performance axis of vehicles where higher horsepower, weight, and displacement correlate with lower fuel efficiency (mpg) and acceleration. This could imply that higher performance vehicles tend to have larger engines and are heavier but are less fuel-efficient.

2) Principal Component 2 (PC2): High Positive Loading: Model Year (0.848212) has a strong positive loading, suggesting that more recent models tend to contribute positively to PC2. Moderate Positive Loadings: Weight (0.206761) and Cylinders (0.178494) have lower positive coefficients, indicating a lesser but still notable influence. Low Loadings for Others: Other features have low or near-zero loadings, indicating they are not significantly contributing to this component.

Insight: PC2 may represent a temporal aspect where newer cars tend to have different characteristics compared to older models. This could be related to advancements in technology, design, or regulatory changes affecting vehicle performance.

3) Principal Component 3 (PC3): High Positive Loading: Acceleration (0.763104) has a strong positive loading, indicating that higher acceleration values contribute positively to this component. Moderate Positive Loadings: Weight (0.239082) also shows a moderate positive influence, suggesting a relationship between acceleration and vehicle weight. Negative Loadings: Horsepower (-0.143674) and mpg (-0.219344) have negative coefficients, indicating that as horsepower and fuel efficiency increase, the value of PC3 decreases.

Insight: PC3 appears to capture a balance between acceleration and power efficiency. This could indicate that cars that accelerate quickly may not necessarily be the most powerful or fuel-efficient.

Overall Insights

1) Understanding Clusters:

These components can help identify distinct groups (or clusters) of vehicles based on performance, efficiency, and modernity. For instance, you could cluster vehicles that are high in PC1 and low in PC2 to find older high-performance vehicles.

2) Performance vs. Efficiency: The strong negative correlation between performance features (like horsepower and weight) and mpg in PC1 suggests a trade-off between performance and fuel efficiency. This insight can be useful for manufacturers focusing on performance vehicles while also considering environmental regulations.

3) Market Segmentation: Insights from PC2 can help marketers target different consumer segments, emphasizing modern features for newer vehicles while highlighting performance attributes for older models.

4) Product Development: Understanding how features contribute to these principal components can guide product development. For example, if a new model is being developed, balancing weight and horsepower could be key for achieving desirable acceleration and fuel efficiency metrics.

5) Future Considerations: As technology evolves, the relationship between these features may change. Continual analysis using PCA can help monitor trends in vehicle performance, efficiency, and consumer preferences over time.

In summary, the coefficients of the principal components provide a deeper understanding of how features interact and influence vehicle characteristics, which can be invaluable for clustering analysis, market strategy, and product development in the automotive industry.

Visualize the data in 2 dimensions using the first two principal components

Observations:

t-SNE

2D visualization

3D visualization

Observations:

We know that t-SNE preserves the local structure of the data while embedding the data from high dimensions to low dimensions. Here, we have generated the 2D and 3D embeddings for the data. We can clearly see 3 groups in the data. Data - it is scattered and clustered together with not much outliers.

Visualize the clusters w.r.t different variables using scatter plot and box plot

Observations: We observe that some perplexity values like 35 and 45 are able to capture the underlying patterns in the data better than other values. This shows that perplexity plays an important role in t-SNE implementation. Let's visualize again with perplexity equal to 35 as there are 3 clear groups which are distant from each other, i.e., well separated.

Observations: We can clearly see 3 groups in the data. Let's label these 3 groups using the values of the X1 and X2 axes.

Observations:

The plot shows clear separation between the three clusters (marked in red, green, and blue).

This indicates that the data points in each cluster are more similar to each other than to those in other clusters.

1) Red Cluster (Cluster 0):

This cluster consists of a majority of the points and is concentrated on the left side of the plot, spanning from approximately X1 = -10 to X1 = -20 and ranging vertically from X2 = -10 to X2 = 10.

The points here might represent a certain group with unique characteristics (e.g., lower values on the x-axis).

2) Green Cluster (Cluster 1):

This cluster appears to be a smaller group located centrally, with points primarily between X1 = 0 to X1 = 10 and X2 = -5 to X2 = 5.

This indicates that these data points may share similar features that differentiate them from the others.

3) Blue Cluster (Cluster 2):

The blue cluster is distinct and spread towards the right side of the plot, around X1 = 20 to X1 = 30 and mostly in the range of X2 = -5 to X2 = 0.

This suggests another unique group with specific properties, potentially indicating higher values along the x-axis.

There appear to be some isolated points, especially in the green and blue clusters. This could indicate outliers or unique observations that may need further investigation.

The distribution of data points indicates varying densities among clusters. For instance, the red cluster seems to have a higher density of points, while the green cluster has a more scattered arrangement.

Each cluster could represent different categories, types, or behaviors in your dataset, warranting further analysis to understand the underlying factors contributing to these separations.

Observations:

1) MPG (Miles per Gallon):

Cluster 0: Represents cars with relatively high fuel efficiency, which is a key characteristic of lighter vehicles or cars with smaller engines. Cluster 1: Cars with lower fuel efficiency, possibly due to larger engines or older technology, reflecting high-displacement engines. Cluster 2: A mix of vehicles, with moderate fuel efficiency but more consistency in mpg.

2) Cylinders: Cluster 0: Contains vehicles with fewer cylinders, likely smaller engines. Cluster 1: A wider distribution of vehicles with varying cylinder counts, indicating the presence of both smaller and larger engine cars. Cluster 2: Homogeneous in cylinder count, possibly a distinct group of similar engine types.

3) Displacement:

Cluster 0: Low engine displacement, typically associated with more fuel-efficient vehicles. Cluster 1: High engine displacement, representing cars with larger engines and likely lower fuel efficiency. Cluster 2: Moderate displacement, possibly representing mid-sized vehicles.

4) Horsepower:

Cluster 1: Cars with significantly higher horsepower, representing powerful vehicles with larger engines. Cluster 0 and 2: Have lower horsepower distributions, possibly representing cars with smaller engines or older models.

5) Weight:

Cluster 1: Heavier cars, which may correlate with larger engines and lower fuel efficiency. Cluster 0: Contains lighter vehicles, which is consistent with higher fuel efficiency. Cluster 2: Represents mid-range vehicles in terms of weight.

6) Acceleration:

Cluster 0: Faster acceleration, possibly due to lighter weight and smaller engines. Cluster 1: Slower acceleration, consistent with the heavier weight and higher horsepower of the vehicles in this group. Cluster 2: Similar to Cluster 0, but with some variation in performance.

7) Model Year:

Cluster 0: Consists of older cars, possibly vintage models. Cluster 1: Represents newer vintage cars with slightly more advanced features. Cluster 2: A mix of years, with some older models.

Conclusion

Through this analysis, we identified three distinct clusters of vintage cars, each representing a unique segment of vehicles:

Cluster 0: Likely represents older, lighter, and more fuel-efficient vehicles, making them appealing to customers interested in vintage cars with economical performance.

Cluster 1: Contains heavier, more powerful vehicles with high horsepower, catering to enthusiasts of vintage muscle cars or high-performance vehicles.

Cluster 2: Represents a middle-ground cluster with moderate features, appealing to a broad range of vintage car buyers.

These insights can help SecondLife target specific customer segments based on vehicle characteristics, improving the dealership’s marketing strategy and sales approach.

Actionable Insights:

  1. High Correlation Among Features: Features such as horsepower, weight, and displacement show strong correlations with miles per gallon (mpg). This indicates that optimizing these factors could significantly enhance fuel efficiency.

  2. Categorical Feature Trends: Certain car names demonstrate higher sales or popularity. Analyzing the features of these models could inform future marketing and production decisions.

  3. Skewed Distributions: Variables like mpg and weight exhibit right skewness. Data transformation techniques may be necessary for more accurate modeling.

  4. Distinct Clusters: The t-SNE analysis indicates distinct clusters within the data, suggesting that different customer segments may have unique preferences and behaviors.

Recommendations:

  1. Feature Optimization: Focus on reducing vehicle weight and enhancing horsepower through engineering improvements, as these are likely to improve mpg.

  2. Targeted Marketing: Utilize insights from popular car names to tailor marketing strategies and promotions for similar models or features that resonate with consumers.

  3. Data Preprocessing: Apply transformations (e.g., log transformation) to skewed variables to enhance model performance and interpretability.

  4. Segmentation Strategies: Leverage the clustering insights to develop targeted marketing strategies for different customer segments based on their preferences and behaviors identified in the analysis.

By implementing these insights and recommendations, the organization can enhance vehicle design, optimize marketing efforts, and ultimately improve customer satisfaction and sales performance.

My Insights on summary of Principal Component Analysis (PCA) and t-SNE (t-Distributed Stochastic Neighbor Embedding) Analyses on the Automobile are as follows:

Recommendation for Business Implication for SecondLife:

-Performance Cars (Cluster 1): Target high-performance enthusiasts interested in vintage muscle cars or sports cars. They may be willing to compromise on fuel efficiency for power and design.

Economical, Fuel-Efficient Cars (Cluster 0): Focus on customers interested in environmentally friendly, efficient vehicles. These cars would appeal to buyers who value economical usage or have an interest in light, classic models.

Technological or Modern Vintage Cars (Cluster 2): Appeal to customers looking for newer vintage models that offer a blend of performance and technological advancements.

By using these insights, SecondLife can tailor its marketing strategy to focus on each group’s unique characteristics and maximize its appeal to specific customer segments.